Automatic linguistic segmentation of conversational speech
نویسندگان
چکیده
As speech recognition moves toward more unconstrained domains such as conversational speech, we encounter a need to be able to segment (or resegment) waveforms and recognizer output into linguistically meaningful units, such a sentences. Toward this end, we present a simple automatic segmenter of transcripts based on N-gram language modeling. We also study the relevance of several word-level features for segmentation performance. Using only word-level information, we achieve 85% recall and 70% precision on linguistic boundary detection.
منابع مشابه
Ethnomethodology and Conversational Analysis
In a speech community, people utilize their communicative competence which they have acquired from their society as part of their distinctive sociolinguistic identity. They negotiate and share meanings, because they have commonsense knowledge about the world, and have universal practical reasoning. Their commonsense knowledge is embodied in their language. Thus, not only does social life depend...
متن کامل"blind" Speech Segmentation: Automatic Segmentation of Speech without Linguistic Knowledge
A new automatic speech segmentation procedure, called the \Blind" speech segmentation, is presented. This procedure allows a speech sample to be segmented into sub-word units without the knowledge of any linguistic information (such as, orthographic or phonetic transcription). Hence, this procedure involves nding the optimal number of sub-word segments in the given speech sample, before locatin...
متن کاملParticipant Subjectivity and Involvement as a Basis for Discourse Segmentation
We propose a framework for analyzing episodic conversational activities in terms of expressed relationships between the participants and utterance content. We test the hypothesis that linguistic features which express such properties, e.g. tense, aspect, and person deixis, are a useful basis for automatic intentional discourse segmentation. We present a novel algorithm and test our hypothesis o...
متن کاملIncorporating linguistic knowledge into automatic dialect identification of Spanish
Automatic dialect identification, like automatic language identification , has often been approached through the use of phonetic frequencies and phonetic sequence modeling. While such statistical systems perform well on language identification problems, they are less adept at the more difficult problem of automatic dialect identification, particularly on short segments of speech. In this paper ...
متن کاملA prosodically labeled database of spontaneous speech
This paper describes a prosodically labeled database of conversational speech, representing a subset of the Switchboard and Callhome corpora. The prosodic transcription system is a simplification of the ToBI system aimed at phenomena that would be most useful for automatic transcription and linguistic analysis of conversational speech. The transcription method and a distributional analysis of t...
متن کامل